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Abstract 

The cyanobacterium, Synechocystis sp. PCC 6803, was the first photosynthetic organism whose genome 
sequence was determined in 1 996 (Kazusa strain). It thus plays an important role in basic research on the 
mechanism, evolution, and molecular genetics of the photosynthetic machinery. There are many sub- 
strains or laboratory strains derived from the original Berkeley strain including glucose-tolerant (GT) 
strains. To establish reliable genomic sequence data of this cyanobacterium, we performed resequencing 
of the genomes of three substrains (GT-I, PCC-P, and PCC-N) and compared the data obtained with those 
of the original Kazusa strain stored in the public database. We found that each substrain has sequence 
differences some of which are likely to reflect specific mutations that may contribute to its altered pheno- 
type. Our resequence data of the PCC substrains along with the proposed corrections/refinements of the 
sequence data for the Kazusa strain and its derivatives are expected to contribute to investigations of the 
evolutionary events in the photosynthetic and related systems that have occurred in Synechocystis as well 
as in other cyanobacteria. 
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1 . Introduction 

Cyanobacteria are capable of oxygenic photosyn- 
thesis; they are thought to be the progenitor of 
plant plastids. Synechocystis sp. PCC 6803 is one 
of the most widely used cya no bacteria I species for 
genetic studies for several major reasons; (i) it is 
naturally competent by incorporating exogenous 
DNA into cells that is integrated into the genome 
by homologous recombination at high frequency; 1-3 
(ii) it grows heterotrophically in the presence 
of glucose; 3,4 (iii) the entire genome sequence was 
determined early on by Kaneko et al. 5 The availability 



of the entire genome sequence facilitated post- 
genomic investigations such as transcriptome-, prote- 
ome-, and functional genomics studies. 6 

The original strain of Synechocystis was isolated from 
California freshwater by Kunisawa and colleagues and 
called the Berkeley strain; 7 it was deposited in the 
Pasteur Culture Collection (PCC strain) and the 
American Type Culture Collection (ATCC strain). 
Williams 3 subsequently isolated the glucose-tolerant 
(GT) strain from the ATCC strain. 3 The Kazusa strain, 
whose genome sequence was published in 1996, 5 is 
a derivative of a GT strain. A single representative 
clone of the GT strain was established for complete 
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genome sequencing as the Kazusa strain; other strains 
were maintained and transferred without further 
cloning such as single colony isolation. 8 Therefore, 
four substrains, PCC-, ATCC-, GT-, and Kazusa strains, 
all derived from the original Berkeley strain, were dis- 
tributed to a number of laboratories although all of 
them were grouped together under the name 
Synechocystis sp. PCC 6803. 9 Ikeuchi and Tabata 8 
reported that each substrain had specific mutations 
such as single nucleotide polymorphisms (SNPs) and 
indels, and some exhibited a specific phenotype. 
Some of the mutated loci that were different from 
the sequence in the database derived from the 
Kazusa strain have been identified such as a SIMP, 10 
indels, 6,1 1 _1 3 and IS mobilization. 14 However, the 
total number of mutations in the whole genome of 
each strain remained unknown and sequence varia- 
tions in these major strains can be expected to raise 
problems in the evaluation of phenotypes of 
mutants constructed from these strains. The history 
of these major substrains and additional substrains, 
isolated as a single colony from the PCC- and GT 
strain (GT-I strain; the standard strain in Dr Ikeuchi's 
group) was summarized by Ikeuchi and Tabata. 8 The 
single colonies isolated from the PCC strain were 
designated PCC-P (positive phototaxis) strain and 
PCC-N (negative phototaxis) strain based on the 
direction of phototactic movement. 15 A derivative of 
the GT-I strain that acquired high light tolerance 
and a glucose-sensitive phenotype was designated 
the WL strain, which has an SNP in the pmgA 
gene. 1 6,1 7 Thus, there are two fundamental problems 
for post-genomic research in bacterial molecular 
genetics. One is the heterogeneity of cells in the 
frozen stock of the culture collection centres; 
the other is the frequent spontaneous mutation 
in bacterial genomes, an event that may be unavoid- 
able during the long cultivation of bacterial cells. 
As revealed in Bacillus subtilis, whole-genome 
resequencing is a powerful solution for obtaining 
the sequence information of such spontaneous 
mutants. 1 8 

Without question, laboratories should start their 
post-genomic research with genome sequence data 
of the 'reference' or 'standard' strain. We deciphered 
the three substrains of Synechocystis sp. PCC 6803, 
i.e. PCC-P, PCC-N, and GT-I, to reconstruct the inform- 
atics basis of the molecular biology of Synechocystis 
sp. PCC 6803. We identified a number of SNPs and 
indels in these substrains and introduced a genetic 
strategy to identify the mutated loci on a genome- 
wide level using the massive parallel sequencer. 
Especially, determination of the genome sequence of 
PCC substrains will widely contribute cyanobacterial 
researches using the frozen stock cells supplied from 
the PCC. 



2. Materials and Methods 

2.1 . Bacterial strains and genomic DNA 

Synechocystis PCC-P, PCC-N, and GT-I strains were 
maintained as frozen stocks in the laboratory of Dr 
Masahiko Ikeuchi at The University of Tokyo, Japan. 
The PCC-P and PCC-N strains that exhibited positive- 
or negative-direction movement under phototaxis 
test conditions, respectively, were isolated by 
Yoshihara et al} 9 as a single colony from frozen 
stock obtained from the French Pasteur Culture 
Collection (PCC strain; see catalogue of strains. 9 ). 
Genomic DNA was extracted with the hot-phenol 
method. 1 6 



2.2. Sequencing methods 

DNA was uniformly sheared into 300-bp portions 
using Adaptive Focused Acoustics (Covaris Inc., 
Woburn, MA, USA). We constructed a DNA library 
with a median insert size of 300 bp for a paired-end 
read format. The quality of the DNA library was 
checked with the Sanger method by Escherichia coli 
transformation of aliquots of the library solution. 
The library was sequenced on a Genome Analyser II 
(lllumina Inc., San Diego, CA, USA). Sample prepar- 
ation, cluster generation, and 50-base paired-end se- 
quencing were according to the manufacturer's 
protocols with minor modifications (lllumina paired- 
end cluster generation kit GAM ver. 2, 36-cycle se- 
quencing kit ver. 3) with multiplex method using 
the single lane of the 8 lane flow-cell. Image analysis 
and ELAND alignment were with lllumina's Pipeline 
Analysis software ver. 1 .6. Sequences passing standard 
lllumina GA pipeline filters were retained. 

2.3. Mapping analyses using short-read sequences 
For short-read alignment and calling variants (SNPs 

and lnDels),we used the short-read mapping software 
MAQ version 0.7.1, 20 BWAver. 0.5.1, 21 and SAMtools 
ver. 0.1. 9. 22 MAQ alignments were done using the 
'easyrun' option of the maq-pl script using the 
default parameter settings. The SNP filtering was per- 
formed using the default parameters except for the 
minimum consensus quality for SNPs (-q 40). BWA 
alignments and the subsequent variants calling 
using SAMtools were done using the default param- 
eter settings. Finally, we applied the following filtering 
criteria to the lists of SNPs/indels: minimum read 
depth for SNPs calling = 3, minimum read depth for 
indel calling = 1 0, and a 60% cut-off of the percent 
of aligned reads calling the SNP/indel per total 
mapped reads at the non-reference allele sites. We 
also used BWA to estimate the sequence read depth 
affecting the coverage and accuracy of the variant 
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calls. Structural variations were identified 
BreakDancer 23 with default parameters. 



2.4. Mapping analyses using contigs assembled 
de novo 

Read sequences were assembled de novo with the 
Velvet assembly programme. 24 For optimization of 
the hash value of the assembly process we used the 
N50 size. The de novo assembled contigs were 
mapped on the genome sequence of a database 
derived from the Kazusa strain using MUMmer se- 
quence alignment package 25 with default settings. 
We then employed show-SNPs functions of the 
MUMmer program to produce lists of SNPs/indels 
which were applied the filtering criteria describe 
above. 

2.5. Annotation and creation of the SNP/indel list 
The list of SNPs/indels was then annotated with in- 

house developed software Variant Annotator (VA) that 
was specifically designed to check amino acid substi- 
tutions attributable to the large number of identified 
SNPs/indels using Genbank annotation files. We used 
the GenBank, RefSeq, cyanobacterial database 
Cyanobase (http://genome.kazusa.or.jp/cyanobase), 
CyanoClust, 26 and also ORF information of the GT-S 
strain, 27 which is the recently resequenced substrain 
of Synechocystis sp. PCC 6803, for precise annotation 
of each ORF. 

2.6. Capillary sequencing with the Sanger method 
for SNP/indel confirmation 

About 200-base genomic regions around the SNPs 
and indels called by the mapping programmes were 
amplified by PCR and sequenced on a capillary se- 
quencer with the Sanger method using the commer- 
cial sequence service of MACROGEN (Tokyo, Japan). 
To confirm the SNPs located near IS elements or re- 
petitive regions, the longer DNA fragments were amp- 
lified to avoid the amplification of other homologous 
regions in the Synechocystis genome. The primers used 
for confirmation are listed in Supplementary Table S1 . 

2.7. Uploading the genome sequence in the database 
Short-read data, obtained on a Genome Analyzer II 

(lllumina Inc., San Diego, CA, USA), of the substrains of 
Synechocystis sp. PCC 6803, PCC-P, PCC-N, and GT-I, 
were deposited in the DRA (DDBJ Sequence Read 
Archive; http://trace.ddbj.nig.ac.jp/DRASearch/); the 
accession number is DRA000401. The genome 
sequences and gene annotations of the substrain 
GT-I, PCC-P, and PCC-N were also deposited in the 
DDBJ/GenBank/EMBL database with the accession 
numbers for each substrain, GT-I (AP01 22 76), PCC- 
P (AP01 2278), and PCC-N (AP01 22 77). 



using 2.8. Phylogenetic analysis of Synechocystis sp. 
PCC 6803 substrains 
Phylogenetic relationship of various strains was esti- 
mated by the maximum parsimony method by as- 
suming that both base change and indel are treated 
as a single event. The computation was performed 
by the dolpenny software of the Phylip package 
version 3. 67, 28 using the polymorphism option. 
Each branch length was set as the number of events 
occurring along the branch. 



3. Results 

3.1 . Analytical scheme applied to the massive short- 
read data obtained by next-generation sequencing 
The amplified DNA library was sequenced by GAM 
with 50-base paired-end methods using the param- 
eter settings described in Materials and methods 
section. We obtained 2 50, 2 57, and 221 Mb read 
data for GT-I, PCC-N, and PCC-P substrains, 
respectively (Table 1 ). These read depths correspond 
to more than 60 times the genome size of 
Synechocystis. Read data were mapped using three 
analyses to identify the genomic position of the 
SNPs, indels, and rearrangements (Fig. 1): (i) BWA 21 
and MAQ 20 for mapping analysis using raw read 
data; (ii) Velvet 24 and MUMmer 25 for mapping ana- 
lysis using de novo assembled contigs; and (iii) 
BreakDancer 23 for rearrangement analysis such as IS 
movement. The number of mutations was called by 
each programme and passed through the filter 



Table 1. Summary of mapping analyses using the read data (BWA, 
MAQ) or the de novo assembled contigs (Velvet and MUMmer) 

Synechocystis sp. PCC 6803 
substrains 





GT-I 


PCC-N 


PCC-P 


Total read bases (Mb) 


250 


257 


221 


Averaged read depth 


70 


72 


62 


Genome coverage (%) 


99.99 


99.99 


99.99 


Mapping programmes 


Number of SNPs and 
indels called by each 
programmes (Final number 
of differences/number of 
differences including 
false-positive data) 


MAQ 


1 6/76 


26/78 


23/89 


BWA 


19/69 


32/79 


28/75 


Velvet and MUMmer 


22/85 


33/1 04 


29/1 09 


BreakDancer 


3/3 


3/3 


3/3 


Final number of differences 
to the database 


28 


44 


39 
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DNA library with approximately 300 bp fragments of Synechocystis genome 



Sequenced data with 50 base paired-end read 



Mapping analysis 
by BWA and MAQ 



De novo assemble 
by Velvet 



Rearrangement 

finding by 
Break dancer 



SNP & Indel data 



I 



Contig mapping 
by MUMmer 



SNP & Indel data 



Rearrangement 
positions 



Confirmed by Sanger method sequence 



Comparison of SNP/indel positions among substrains and database 



Figure 1. Analytical scheme of the read data obtained by massive 
parallel sequencing. The preparation of the DNA library is 
described in Materials and methods section. The mapping 
programmes BWA and MAQ were used for short-read data; the 
de novo assembly programme was Velvet, and MUMmer was 
the mapping programme for assembled contigs. 



settings of the SAMtools program 22 (Table 1 ). 
Distributions of averaged read depth in each 1 kb 
along with the entire genome, obtained by BWA and 
MAQ were shown in Supplemental Fig. SI. At least 
1 5 times read depth was obtained even in the 
lowest read depth region. It also indicates that pat- 
terns of the read depth distribution depend on the al- 
gorithm of each mapping programme. The mapping 
programmes BWA, MAQ, and MUMer called 76, 69, 
and 85 potential mutation points, respectively, for 
the GT-I strain as primary data, 89, 75, and 109 
points for the PCC-P strain, and 78, 79, and 104 
points for the PCC-N strain. To confirm these results, 
we checked the sequence around all these loci by 
Sanger sequencing; regions of ~200 bp were ampli- 
fied around the position of the mutations. 
Depending on the parameter settings, read depth 
and sequence specificity, it happens that common 
SNPs in all three substrains were detected only in 
one or two substrains in each programme. Even if 
the SNP is called only in one strain, we performed 
Sanger sequencing of the same locus in all three 
strains. Whole oligo DNA primers prepared for PCR 
reactions are listed in Supplemental Table S1 . 

3.2. Combinatorial use of the mapping programmes 
contributes to the identification of SNPs and indels 
Confirmation by Sanger sequencing alerted to a 
number of false-positive SNP- and indel-calls. Final 
numbers of SNPs/indels in each substrain were only 
about 2 0-40% of the called numbers by each pro- 
gramme as shown in Table 1 . Most of the false- 



positive SNPs or indels were located in repetitive 
regions or in highly homologous genes in the 
Synechocystis genome. We expected that the cut-off 
value of mapping programmes as 60% is enough to 
detect a number of heterogeneous SNPs of 
Synechocystis which has multi-copy genome, because 
80% SNPs were called as heterogeneous SNPs by 
BWA in case of GT-I strain. However, we could not 
detect any heterogeneous SNPs among them by the 
Sanger method. It indicates that there are technical 
problems in expecting the total number of the hetero- 
geneous SNPs in the whole genome using cut-off 
value settings or base-call percentage data obtained 
by these mapping programmes. On the other hand, 
all 1 6 SNPs called homogeneous SNPs by BWA in GT- 
I strain were also confirmed by the Sanger method. 
The total number of mutations confirmed by the 
Sanger method is shown in Fig. 2 with the number 
of mutations including false-positive data in paren- 
thesis. These numbers are the sum of results obtained 
for the three substrains. We found that the combina- 
torial use of several programmes is necessary for the 
comprehensive detection of SNPs/indels and for the 
identification of mutations. Mutation loci detected 
commonly by all three programmes were more reli- 
able than that detected by only the single programme 
(Fig. 2). Difference of the distribution pattern of the 




BreakDancer 
3 (IS) 



Figure 2. Diagram of the mutations identified by each programme. 
The number of mutations (SNPs and indels) confirmed by the 
Sanger method is shown in each circle with the number of 
mutations including false-positive data in parenthesis. The 
number of mutations detected by plural programmes is 
indicated in the circle overlap region. Threshold (cut-off) value 
of 60% was used in mapping programmes; BWA and MAQ (see 
Materials and methods section). Mutations detected 
commonly by all three programmes were more reliable. The 
combinatorial use of the mapping programmes is important 
for the genome-wide identification of the mutation loci. 
Numbers labelled with an asterisk contain miss-called results 
indicated by parenthesis in Tables 2 and 3. 



Table 2. List of the genomic loci of SNPs and indels found in all GT-1, PCC-P, and PCC-N strains compared with the nucleotide sequence in the database 



Genomic loci 


Type 


Data 
base 


GT- 

Kazusa 


GT-S 
strain 


GT-I 
strain 


PCC-P 
strain 


PCC-N 

strain 


Quality 

score 


Source 


Gene ID 


Annotation 


Amino acid change 


Comment 


943495 


SNP 


C 


A 


A 


A 


A 


A 


255 255 


MAQ BWA 


slr1 834 


psaA 


V6041 


Smart and Mcintosh 




















mummer 








(1 0). Error of the 
Database (27) 



















1 01 2958 



1 2001 43- 
1 201488 
(1 200306) 



1 3641 87 
2092571 
21 98893 
2204584 

2301 721 

2350285- 
2350286 

2360245- 
2360246 

2409244 

241 9399 

2544044- 
2544045 

260271 7 
2602734 



SNP 



Indel IS (C) IS 
(SNP) 



SNP 
SNP 



SNP T 

Indel G 

SNP A 

Indel — 

Indel — 

Indel C 

Indel T 

Indel — 

SNP C 

SNP T 




255 255 MAQ BWA 
mummer 



intergenic 
region 
ssl3 7 77- 
SII1633 

(108 233) (MAQ BWA) sill 780 
99 BreakDancer 



MAQ BWA 
mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 

mummer 



MAQ BWA 
mummer 

BWA mummer 



BWA mummer 

mummer 

BWA mummer 

BWA mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 



SII0838 
SII0422 
SII0142 
slr0162 

slr0168 

intergenic 
region 
sml0001- 
slr0363 

slr0364 
sl!0762 
sli0752 
SSI0787 
slr0468 
slr0468 



repA-ftsZ 



ISY203b 



pyrF 
SII0422 
SII0142 
gspF 

slr0168 



psbl- 
slr0363 



slr0364 
SII0762 
SII0752 
SSI0787 
slr0468 
slr0468 



Silent 

L31 3* Stop codon 
689 Silent 
gspF+pilC 

K403E 



Frameshift 

Frameshift 

Frameshift 

Frameshift 

H82Q 

188N 



Error of the Database 
(27) 



Insertion of 
transposase (1 4). GT- 
Kazusa specific (27). 
MAQ and BWA 
detected this indel 
region as SNP 

Error of the Database 
(27) 

Error of the Database 
(27) 

Error of the Database 
(27) 

G-insertion in GT- 
Kazusa strain causes 
split of the original 
pile gene (34). GT- 
Kazusa specific (27) 

Error of the Database 
(27) 

Error of the Database 
(27) 



Error of the Database 
(27) 

Error of the Database 
(27) 

Error of the Database 
(27) 

Error of the Database 
(27) 

Error of the Database 
(27) 

Error of the Database 
(27) 



?N 
CD 

rs 

re 



TO 



Continued 



Table 2. Continued 



Genomic loci 



Type 



Data 
base 



GT- GT-S GT-I 

Kazusa strain strain 



PCC-P 
strain 



PCC-N 
strain 



Quality 
score 



Source 


Gene ID 


Annotation 


Amino acid change 


Comment 


MAQ BWA 
mummer 


intergenic 
region 
slr02 7 0- 
ssr0332 


slr02 1 0- 
ssr0332 




Error of the Database 
(27) 


MAQ BWA 
mummer 


SII004S 


sps 


75 Silent 


Error of the Database 
(27) 


M u m mer 


\ntsrgsuic 
region 
silUS2y- 
SII0528 


SII0528 




GT~K3zus3 specific 
(27) 


BreakDancer 


sill 474 


ISY203g 


sll14Z3+sll147S 


Insertion of 
transposase. IS- 
insertion causes split 
of the original hil<32 
gene (1 4). GT-Kazusa 
specific (27) 


Source 


Gene ID 


Annotation 


Amino acid change 


Comment 


(MAQ) 


slr1084 


sir 1084 


34 amino acids 


This indel region was 



2748897 



3142651 
3260096 



3400322- 
3401 506 



Genomic loci 

386410- 
38641 1 
(386406) 



1 1 92983 



2048341 - 
2049583 



SNP 




255 255 



255 255 



Type Data GT- GT-S GT-I PCC-P PCC-N Quality 

base Kazusa strain strain strain strain score 



Indel — 
(SNP) (T) 



102 bp 102 bp 102 bp (68) 
(A) (A) 




SNP A A A C/A C/A C/A 



Indel IS IS IS 




Mummer 



slr1855 



BreakDancer slr1635 



deletion (V77D) 



slr1855 



T167P 



ISY203e 



called as SNP by MAQ 
as shown in 
parentheses. This 
indel region was not 
detected in the GT-S 
strain (27). 
CTGGGGGAAAAATGT 
TGGATTGATAACCTCG 
CCCCGGTTACCATTG 
AGTCCCATGTGTGTAT 
TTCCCAGGGCGTTTA 
CCTATGCACTGGCAAC 
CACGATTGG 

Potential 
heterogeneous 
nucleotide (Intensity 
of the peaks due to C 
and A were almost 
equal.) This SNP was 
not detected in the 
GT-S strain (27) 

Insertion of 
transposase (1 4). 
Specific IS in GT- 
Kazusa and GT-S 
strains (27) 



I 



The left column shows the genomic locus of each mutation in the database (NCBI accession number; NC_00091 1). Quality scores indicate the phred-scaled 
scores called by MAQ and BWA, respectively. Quality scores given by BreakDancer is a software-original value. The upper table listed the mutations that were 
suggested as the error of the database and also that the GT-Kazusa strain-specific mutations such as ISY203b, ISY203g, and the locus 2204584 (31). Lower 
table shows additional differences found only in GT-I, PCC-P and PCC-N strains. Greyed columns emphasize the different sites and their details. Several indel 
regions miscalled as SNP by MAQ and BWA were shown in parentheses. 



Table 3. List of the genomic loci of SNPs and indels found in the specific strains compared with the nucleotide sequence in the database 



Genomic loci 


Type 


Data base 


CT- 
Kazusa 


GT-S 
strain 


GT-l 
strain 


PCC-P 
strain 


PCC-N 
strain 




Quality 
score 


Source 


Gene ID 


Annotation 


Amino acid 
change 


Comment 


1 26257 


SNP 


C 


C 


C 


C 




T 




255 255 


MAQ BWA 


SII0698 


hik33 


D63N 


Different site between 






















mummer 








GT strains and PCC 
strains 


























731 367 


Indel 


T 


T 


T 


T 








1 55 


BWA 

mummer 


sill 574 


sill 574 


sll1574+slllS75 


T insertion in GT strains 
causes gene split of the 
original spkA gene. 
Different site between 
GT/ATCC strains and 
PCC strains (1 2) 



781 625- Indel 
781626 



1 300941 - 
1 300985 
(1 300977) 



1 81 241 9 



Indel (SNP) 45 bp 



1423340- Indel 
1423341 



SNP 



45 bp 



45 bp 



154 bp 154 bp 






45 bp (C) 








intergenic slr2030- 
region slr2031 
slr2030- 
slr2031 



255 255 MAQ BWA intergenic infA-adk 
mummer region 

ssl3441- 
SII181S 

255 255 MAQ BWA slr!86S slr186S 
mummer 

(164 (MAQ BWA) slr1819 siri 81 9 

255) 



386 BWA 

mummer 



sill 951 



255 255 MAQ BWA slr1993 
mummer 

255 255 MAQ BWA slr1983 
mummer 

255 255 MAQ BWA slr0222 
mummer 



sill 951 



till 



slr1983 



slr0222 



1 5 amino acids 
deletion 



N1 438* Stop 
codon 



A225V 



Different site between 

GT strains and PCC 

strains (1 3). TTTAAAC 

GTCATGCACCAATCTC 

TGATTTACTGGTTTATTC 

ATCTATCAATTCCATAGCC 

TTTTTGCTTCATCGCTCC 

AACTAACTTTTC 

TGGGATGTCCTCC 

ATGCCCCCCGTGCCTAGC 

TTACCGTCCACCGATGCC 

GTTATTCCCCCCGGC 

Different site between 
GT strains and PCC 
strains 

Different site between 
GT strains and PCC 
strains 

Putative PCC strains- 
specific 4 5 bp deletion 
without frameshift. This 
indel region was called 
as SNP by MAQ and BWA 
as shown in parentheses. 
GGGCTATCCTGC 
GGGATACCGACATGACCC 
TGGCCACTCTCCAGG 

Different site between 
GT strains and PCC 
strains 

Different site between 
GT strains and PCC 
strains 

Different site between 
GT strains and PCC 
strains 

Different site between 
GT strains and PCC 
strains 



CD 

rs 

CD 



TO 



Continued 



Table 3. Continued 



Genomic loci 


Type 


Data base GT- 

Kazusa 


GT-S 
strain 


GT-I 
strain 


PCC-P 
strain 


PCC-N 
strain 


Quality 
score 


Source 


Gene ID 


Annotation 


Amino acid 
change 


Comment 


2736514- 
273651 5 


Indel 








T 


T 


257 


BWA 

mummer 


sli0 1 82 


sIl0 1 82 


Frameshift 


Different site between 
GT strains and PCC 
strains 



301 4665 



30961 87 



Genomic loci 

387006 

842060 

909360 

1 392586 

1470212 

1764198 

Genomic loci 

1 2521 8 

1437136 

2674108 

69849 

1 25262- 
1 25273 



1 763998 



SNP 



SNP 



Type 
SNP 
SNP 
SNP 
SNP 
SNP 
SNP 
Type 
SNP 
SNP 
SNP 
SNP 
Indel 



Data base GT- 

Kazusa 



Data base GT- 

Kazusa 



C 
C 

12 bp 



C 
C 

12 bp 



GT-S 
strain 

C 
C 



GT-S 
strain 



C 
C 

12 bp 



CT-I 
strain 



SNP 




255 255 MAQ BWA slr0302 
mummer 



MAQ 



MAQ BWA 
mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 



MAQ BWA 
mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 

MAQ BWA 
mummer 



MAQ BWA 
mummer 

MAQ BWA 



ssr1175 



217 224 MAQ BWA ssr1 1 76 



Gene ID 
slr1085 
sin 799 
sill 968 
slr12S0 
sill 60S 
slr1962 
Gene ID 
si 10 69 8 
sir 1992 
slr0645 
slr1 1 19 
si 10 69 8 

slr1 510 
slr1962 



slr0302 



ssr1 175 



slr1085 

rpIC 

pmgA 

pstB 

fabZ 

slr1962 

Annotatit 

hil<33 

slr1 992 

slr0645 

slr1 119 

hil<33 

pIsX 
slr1962 



92 Silent 



I47T 



Amino acid 
change 

P1 09L 

R1 85Q 

E93K 

L204S 

R46C 

F1 58C 

Amino acid 
change 

T409M 
146 Silent 
A3V 
R1 89Q 

Four amino acids 
deletion 



E91 D 



Different site between 
GT strains and PCC 
strains 

Different site between 
GT strains and PCC 
strains 

Different site between 
GT strains and PCC 
strains. Potential 
heterogenous nucleotide 
(Small T peak was also 
detected in the PCC 
strains) 

Comment 



GT-l strain-specific 

GT-l strain-specific 

GT-l strain-specific 

GT-l strain-specific 

GT-l strain-specific 

GT-l strain-specific 

Comment 

PCC-P strain-specific 

PCC-P strain-specific 

PCC-P strain-specific 

PCC-N strain-specific 

PCC-N strain-specific 
1 2base deletion without 
frameshift. 
CTGGGTCAACAT 

PCC-N strain-specific 
PCC-N strain-specific 
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"g O read depth in each mapping programme also suggests 

g S that combination use of several programmes is more 

_g- adequate (Supplemental Fig. S1). However, it is worth 

j= ^ noting, SNPs/indels identified commonly by three pro- 

°- 1/2 grammes covered only 60% of the total number of 

£ ™ mutations. The correct detection of IS movements 

«j 13 reported by Okamoto a/. 14 was possible only with 

.3 g BreakDancer, a programme developed for the detec- 

c "| tion of genome rearrangements. 23 

i g 

o o 
y 'So 

~ -53 3.3. Comparison of the identified SNPs/indels and 

^ 1 t/je genome sequence in the database 

^ "m The confirmed mutations are listed in Tables 2 and 

3. The three substrains, GT-I, PCC-P, and PCC-N, man- 
ifested at least 2 2 common different sites compared 
with the sequence of the Kazusa strain in the data- 
base. Among these mutation sites, 1 5 sites were dif- 
ferent from the database sequence, but not real 
differences in the genomic loci as revealed by Tajima 
et al. 27 (Table 2). We also found that there are 
totally 14 mutations between the GT/Kazusa- and 
the PCC-P/PCC-N strains (Table 3). The PCC-P and 
PCC-N substrains contained three and eight addition- 
al specific mutations, respectively (Table 3). These 
may be potential mutations that elicited the known 
difference in the phototactic phenotype of the PCC 
substrains. For example, the PCC-N substrain has mu- 
tation in the gspE2 (pilB2) gene for pilus assembly, 
which also moderately affects the transformation effi- 
ciency. 15 The PCC-N strain also has a 1 2-base dele- 
tion in the kinase domain of the hil<33 gene for the 
histidine kinase without a frameshift. Hil<33 is the 
multi-stress sensor in Synechocystis and it is conserved 
in all cyanobacterial species. 29-31 This suggests that 
this substrain may lose the Hik33-dependent regula- 
tion of global gene expression, although the relation- 
ship between hil<33 and phototaxis remains to be 
determined. 

A part of the indel regions was miss-called asSNPs by 
the mapping programmes. We aligned the genome se- 
quence of the substrains around the indels (Fig. 3) and 
found that these indels were located in the middle of 
two direct repeat- or direct repeat-like sequences. 
Interestingly, these direct repeats found in the 
deleted regions were not common sequence. 
Mapping analyses using de novo assembled contigs 
(Velvet and MUMmer) help to correct the false-positive 
SNP/indel-calls made by mapping programmessuch as 
BWA and MAQ (Table 2). 

Some of the mutations confirmed by the Sanger 
method were putative heterogeneous SNPs. At these 
SNP loci, we detected a second peak from the other 
nucleotide. As the Synechocystis genome is multi- 
copy, some of the genome copies may harbour 
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A Upstream region of the slr2031 

> PCC-P and PCC-N TTTGCTCAAACCATTTGGTAAAACTGCTCAATGGACGAGCCGATTTTCACCCCGGCTTTA 

> GT strains TTTGCTCAAACCATTTGGTAAAACTGCTCAATGGACGAGCCGATTTTCACCCCGGC 

******************************************************** 

AACGTCATGCACCAATCTCTGATTTACTGGTTTATTCATCTATCAATTCCATAGGCTTTT 



TGCTTCATCGCTCCAACTAACTTTTCTGGGATGTCCTCCATGCCCCCCGTGCCTAGCTTA 



CCGTCCACCGATGCCGTTATTCCCCCCGGCAATTTTGTTGAACCTCCCACTTCTCCGGTG 

AATTTTGTTGAACCTCCCACTTCTCCGGTG 

****************************** 



B hik33 

> Other Strains AAATTTTCCCTGTTCTGATCCAACACCTGGGTCAACATCAGACGAATGGTGCGGGGAAACG 

> PCC-N AAATTTTCCCTGTTCTGATCCAACAC CAGACGAATGGTGCGGGGAAACG 

************************** *********************** 



C slrl819 

> GT strains GGCCGGAGCCGATCTACGCAGTGCCAATTTTCACGGGGCCATGCTCCAGGGGGCTATCCTG 

> PCC-P and PCC-N GGCCGGAGCCGATCTACGCAGTGCCAATTTTCACGGGGCCATGCTCCAGG 

************************************************** 



CGGGATAGCGACATGACCCTGGCCACTCTCCAGGATACGAATTTAATTGGGGCGGATCTAC 

ATACGAATTTAATTGGGGCGGATCTAC 

*************************** 



D sir 1084 

> Other strains AAATTCCCTTGGCGGCTAACCCTGGGCAATTACGTTTGGCTGGGGGAAAAATGTTGGATTG 

> Kazusa & GT-S AAATTCCCTTGGCGGCTAACCCTGGGCAATTACGTTTGG 

*************************************** 

ATAACCTCGCCCCGGTTACCATTGAGTCCCATGTGTGTATTTCCCAGGGCGTTTACCTATG 



CACTGGCAACCACGATTGGAGTAAACCCAGCTTTGACCTAATCACCAGTCCGATTCACATC 

AGTAAACCCAGCTTTGACCTAATCACCAGTCCGATTCACATC 

****************************************** 

Figure 3. Alignment of the specific indel regions whose consensus read bases were miss-called. (A) The 1 54-base deletion in the slr203 1 
gene 13 in the GT strains. (B) The 1 2-base deletion in the hil<33 gene in the PCC-N strain. (C) The 45-base deletion in thes/r7S7 9 gene 
in PCC-P and PCC-N strains. (D) The 1 02-base deletion in the slr1084 gene in the GT-S and Kazusa strains. Deleted regions were 
underlined and direct-repeat sequences were emphasized by grey colour. These deleted loci were situated in the middle of the 
direct-repeat sequences. 
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PCC-N strain 



Additional 
8 specific 
mutations 



\ 



Fresh water in Oakland, California 
(Isolated by R. Kunisawa in 1968) 



Positive/Negative photolaxis 
(Voshihar a el al . 2000) 



Motile 
( Ripkjel al., 19 79: 

PCC 6803 ' 



Additional 
3 specific 
mutations 



Complete genome sequence 
(Kaneko etal., 1996) 




154 bp deletion 
in slrJOil 
(Katohet at., 1995; 
Kamei et al.. 1998) 



Glucose tolerant 
(Williams et al , 1988) 



PCC strains 




GT strains 




w 



14 mutations between PCC and GT 
including slr2031 and spkA mutations 



Figure 4. Unrooted tree of phylogenetic relationship of various strains of Synechocystis sp. PCC 6803. Known events are indicated on each 
branch. The number of mutations in each substrain to the database sequence (Kazusa strain) was indicated. The scale bar indicates the 
distance of branch corresponding to the number of mutations. 



heterogeneous SNPs. These loci were not in a multi- 
repeat region of the genome. 

A phylogenetic scheme of the history of the 
Synechocystis sp. PCC 6803 substrains is presented in 
Fig. 4. The predicted phylogenetic relationship of 
various strains indicated that the putative root may 
exist between the PCC branch and the GT branch. It 
suggests that all existing substrains do not have the 
original sequence of this organism isolated in 1 968. 



4. Discussion 

4. 7. Problems associated with the heterogeneity 
of frozen stocks of bacterial strains 
Frozen stocks in culture centres are basically stored 
as heterogeneous cell groups of the particular bac- 
teria, such as £. coli K-1 2 MG1 655. 32 Thus, when ali- 
quots of the frozen cells are thawed, selection bias due 
to the growth conditions arises; this affects the major 
genotype of the cells in culture. We posit that the het- 
erogeneity of cells in different laboratories affected 
the results of genetic research or phenotypic analyses 
and we suggest that post-genomic research on cya no- 
bacteria in the next generation should be based on 
resequenced strains as the new reference for each la- 
boratory. In studies on post-genomic science using 
various mutants, it is better to prepare the frozen 



stock cells of single colonies one by one and also of 
the parental cell as soon as possible after the isolation 
of the mutants. It is also unquestionable that long- 
term cultivation of the cells on the gels or in the 
liquid cultures is not appropriate to keep the geno- 
type of the cells. 

4.2. Potential factors that affect phenotypic differences 
in 6803 substrains 

We found a small numberof substrain-specific muta- 
tions in PCC-P and PCC-N strains. It can be expected 
that a mutation accounts for the difference in the 
phototactic phenotype of PCC-P and PCC-N strains. 19 
Furthermore, the 14 mutations found in the GT/ 
Kazusa- and the PCC-P/PCC-N strains suggest that 
their loci may affect the cell motility of these strains 8 
or their glucose tolerance. 3 Our findings may be 
useful in functional studies on the gene that is dis- 
rupted by SNPs or indels in specific substrains. As 
Kamei et al) 2 or Okamoto et al? 4 suggested, the real 
function of several genes could only be studied in spe- 
cific substrains which have the original nucleotide se- 
quence without any SNPs, indels, or IS insertions. 

4.3. Indels located between two direct repeats result 
in miss-calls by the mapping programme 

Confirmation of our results with the Sanger 
method revealed that specific deletion patterns 
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were miss-called or undetected at high frequency by 
mapping analysis of short-read type data. 
Several deleted regions in the middle of direct- 
repeat sequences were miss-called as SNPs or 
undetected (Tables 2 and 3). Alignment of the 
DNA sequence of the three substrains clearly 
showed direct repeats on both sides of the deletion 
locus (Fig. 3). After the deletion event only a single 
direct repeat sequence remained, leading to the hy- 
pothesis that the deletion was due to self-crossover. 
At present we do not know what mechanisms or 
factors trigger these deletions, but caution should 
be exercised when we perform long-term cultivation 
of cyanobacterial cells. This is also a technical 
problem of mapping analysis to detect the exact 
positions of SNPs and indels using massive short- 
read data of next-generation sequencers. 



4.4. Problems raised by heterozygous 
and homozygous SNPs 

In this analysis, we found that the threshold value 
for SNP detection by BWA and MAQ was more than 
60% of read data covering the SNP positions. 
However, Synechocystis sp. strain PCC6803 has a 
multi-copy genome, 33 suggesting that the genome 
contains unidentified heterozygous SNPs below the 
threshold value. It is technically difficult to identify 
all heterozygous SNPs in the genome; if we lower 
the threshold, the number of false-positive SNPs 
increases drastically. Mutations hidden as minor het- 
erozygous SNPs may become major SNPs under spe- 
cific conditions and affect the cell phenotype, an 
observation reported by Hihara and Ikeuchi 16 and 
Hihara et al.? 7 who studied the pmgA mutant. The 
active retention of heterogeneity may be a strategy 
of cyanobacteria for acclimation to environmental 
changes. Heterogeneous SNPs found in the same 
locus in several substrains such as the loci 
1 192983 and 3098707 were the candidates to 
understand such mechanisms. The detection of 
minor heterozygous SNPs is a future problem for 
mapping using massive parallel sequencing. 

In this study, we identified a number of differences 
in the genome sequence of laboratory strains and 
published sequence data derived from the Kazusa 
strain. Resequence data on PCC substrains will be 
useful for considering evolutionary events among 
Synechocystis intra-species. For the reconstruction of 
informatics in post-genomic studies on Synechocystis 
sp. PCC 6803, resequence analysis is effective and 
represents a powerful genetic strategy to identify po- 
tential mutation loci in spontaneous mutants with 
altered phenotypes. 
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